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Abstract 

We consider the problem of estimating the support size of a discrete distribution whose 
minimum non-zero mass is at least p Under the independent sampling model, we show that 
the sample complexity, i.e., the minimal sample size to achieve an additive error of ek with 
probability at least 0.1 is within universal constant factors of log^ p which improves the 
state-of-the-art result of ^2 ^ in [VV13]. Similar characterization of the minimax risk is also 

obtained. Our procedure is a linear estimator based on the Chebyshev polynomial and its 
approximation-theoretic properties, which can be evaluated in 0{n + log^ k) time and attains 
the sample complexity within a factor of six asymptotically. The superiority of the proposed 
estimator in terms of accuracy, computational efficiency and scalability is demonstrated in a 
variety of synthetic and real datasets. 


1 Introduction 

1.1 Model 

Estimating the support size of a distribution from data is a classical problem in statistics with 
widespread applications. For example, a major task for ecologists is to estimate the number of 
species [FCW43] from field experiments; linguists are interested in estimating the vocabulary size 
of Shakespeare based on his complete works [McN73, ET76, TE87]; in population genetics it is 
of great interest to estimate the number of different alleles in a population [HWOl]. Estimating 
the support size is equivalent to estimating the number of unseen symbols, which is particularly 
challenging when the sample size is relatively small compared to the total population size, since a 
significant portion of the population are never observed in the data. 

We adopt the following statistical model [BO79,RRSS09]. Let P be a discrete distribution over 
some countable alphabet. Without loss of generality, we assume the alphabet is N and denote 
P = (pi,P 2 j ■ ■ • Given n i.i.d. samples X = (Xi,..., X„) drawn from P, the goal is to estimate 
the support size 

S(P) = E (1) 

i 

To estimate the distribution or its functionals, a sufficient statistic is the histogram of the samples, 
denoted by Y = {Ni , Y 2 ,...) and 

n 

". = El«..|- (2) 

i=i 

*The authors are with the Department of Electrical and Computer Engineering and the Coordinated Science Lab, 
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Therefore N has a multinomial distribution with parameter n and P. For estimating the support 
size (or other permutation-invariant functional of the distribution), the fingerprints form a sufficient 
statistic which is a further summary of the histogram N, which are defined as 

i 

i.e., the number of items that appear exactly j times. 

It is clear that unless we impose further assumptions on the distribution P, it is impossible to 
estimate S{P) within a given accuracy, for otherwise there can be arbitrarily many masses in the 
support of P that never occur in the samples with high probability and the risk for estimating S{P) 
is obviously infinite. To prevent the triviality, a conventional assumption [RRSS09] is to impose 
a lower bound on the non-zero probabilities. Therefore we restrict our attention to the parameter 
space which consists of all probability distributions on N whose minimum non-zero mass is at 
least consequently S{P) < k for any P G The decision-theoretic fundamental limit of this 
problem is given by the minimax risk: 

R*{k,n) ^ inf sup E[£(5,5)], (4) 

5 PeCfe 

where the loss function £{S, S) = is the normalized mean squared error (MSE) and S is an 

integer-valued estimator measurable with respect to the samples Xi,... 

1.2 Main results 

Our first main result is the following characterization of the minimax risk: 

Theorem 1. For all k,n>2, 


R*{k, n) = exp 




(5) 


Furthermore, if n <C A: log A:, as k ^ oo, 

-{V2e + < R*ik, n) < exp |'-(1.579 + 0 ( 1 ))^^^^ 


exp 


( 6 ) 


To interpret the rate of convergence in (5), we consider three cases: 


Simple regime n > A: log k: we have R*{k, n) = exp(—0(^)) which can be achieved by the simple 
plug-in estimator 

Sseen = l{Afi>0}5 ('^) 


that is, the number of observed symbols. Furthermore, if exceeds a sufficiently large 

constant, all symbols are present in the data and S'seen is in fact exact with high probability, 
namely, P[5seen / -S'] < E(5seen — S)'^ —)• 0. This can be understood as the classical coupon 
collector’s problem (cf. e.g., [MU05]). 
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Non-trivial regime n <C k\ogk\ In this case the samples are relatively scarce and the 

naive plug-in estimator grossly underestimate the true support size as many symbols are 
simply not observed. Nevertheless, accurate estimation is still possible and the optimal rate 

of convergence is given by R*{k,n) = exp(—This can be achieved by a linear 
estimator based on the Chebyshev polynomial and its approximation-theoretic properties. 
Although more sophisticated than the plug-in estimator, this procedure can be evaluated in 
0{n + log^ k) time. 

Impossible regime n < no consistent estimator exists. 

Next we discuss the sample complexity of estimating the support size, which is defined as follows: 


n*{k, e) = min{n > 0: 35, s.t. P[|5 — 5(P)| > ek] < 0.1, VP G Pfc}; (8) 

where S is an integer-valued estimator measurable with respect to the samples Xi,... ,X„''^'P. 
Clearly, since 5 — 5 is an integer, the only interesting case is e > with e = | corresponding to 
the exact estimation of the support size since |5 — 5| < 1 is equivalent to S = S. Furthermore, 
since S{P) takes values in [A:], n*{k,^) = 0 by definition. The next result characterizes the sample 
complexity within universal constant factors that are within a factor of six asymptotically. 

Theorem 2. Fix a constant cq < ^. For all \ < e < cq, 

Furthermore, i/ e —>■ 0 and e = k°^^\ as k ^ oo, 


1 + 0(1) fc 2 I 

2e^ log k e 


< n*{k, e) < 


1+0(1) k 2l_ 

2.494 log A: e 


( 10 ) 


Compared to Theorem 1, the only difference is that here we are dealing with the zero-one loss 
l|| 5 _ 5 |>eA:} instead of the quadratic loss (^^)^- In the proof we shall obtain upper bound for 
the quadratic risk and lower bound for the zero-one loss, thereby proving both Theorem 1 and 2 
simultaneously. Furthermore, the choice of 0.1 as the probability of error in the definition of the 
sample complexity is entirely arbitrary; replacing it by 1 — d for any constant 6 G (0,1) only affect 
n*{k,e) up to constant factors.^ 


1.3 Previous work 

There is a vast amount of literature devoted to the support size estimation problem. In paramet¬ 
ric settings, the data generating distribution is assumed to belong to certain parametric family 
such as uniform or Zipf [LP56, McN73, DR80] and traditional estimators, such as maximum like¬ 
lihood estimator and minimum variance unbiased estimator, are frequently used [Har68, MSJ82, 
Sam68, ET76, LP56, HWOl] - see the extensive surveys [BF93,GS04]. When difficult to postulate 
or justify a suitable parametric assumption, various nonparametric approaches are adopted such 
as the Good-Turing estimator [Goo53, Rob68] and variants due to Chao and Lee [Cha84, CL92], 

^Specifically, upgrading the confidence to 1 — <5 can be achieved by oversampling by merely a factor of log Let 
T = log I. With nT samples, divide them into T batches, apply the n-sample estimator to each batch and aggregate 
by taking the median. Then Hoeffding’s inequality implies the desired confidence. 
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Jackknife estimator [B079], empirical Bayes approach (e.g., Good-Toulmin estimator [GT56]), one¬ 
sided estimator [ML07]. Despite their practical popularity, little is known about the performance 
guarantee of these estimators, let alone their optimality. Next we discuss provable results assuming 
the independent sampling model in Section 1.1. 

For the naive plug-in estimator (7), it is easy to show (see Proposition 2) that to estimate S{P) 
within ±ek the minimal required number of samples is ©(/slog^), which scales logarithmically in 
^ but linearly in k, the same scaling for estimating the distribution P itself. Recently Valiant 
and Valiant [VVll] showed that the sample complexity is in fact sub-linear in fe; however, the 
performance guarantee of the proposed estimators are still far from being optimal. Specifically, 
an estimator based on a linear program that is a modification of [ET76, Program 2] is proposed 
and shown to achieve n*{k,e) < ^ 2 + 6 ^^^ ^ for any arbitrary J > 0 [VVll, Gorollary 11], which has 
subsequently been improved to ^2 i^g^ fo [VV13, Theorem 2, Fact 9]. The lower bound n*{k,e) > 
in [VVIO, Gorollary 9] is optimal in k but provides no dependence on e. These results show 
that the optimal scaling in terms of k is but the dependence on the accuracy e is which is 
even worse than the plug-in estimator. From Theorem 2 we see that the dependence on e can be 
improved from polynomial to polylogarithmic log^ which turns out to be optimal. Furthermore, 
this can be attained by a linear estimator which is far more scalable than linear programming on 
massive datasets (see the experiment on New York Times datasets of one billion words in Section 4). 

A closely related problem is the distinct elements problem, where the goal is to estimate the 
number of distinct colors based on repeated draws from in an urn consisting of k colored balls. 
For sampling with replacement, this can be viewed as a restricted case of the model in the present 
paper, where the distribution P = (pi) has the special form of p* = ^, with ki £ Z+ corresponding 
to the number of balls of the color and ki = k. The sample complexity under multiplicative 
error, that is, estimating S{P) within a factor of a has been shown to be 0(^) in [GCMNOOj. 
For additive error, that is, estimating S{P) within ±ek, a lower bound has been established in 

l_Q( j log log fc ■) 

[RRSS09], which, for constant e, scales as k A _ This, in turn, implies a lower bound for 

n*(A:, e), which is slightly suboptimal compared to the tight bound = k logfc . 

1.4 Organization 

The paper is organized as follows: In Section 2 we outline the proof for the lower bound part of 
Theorem 1 and 2 and the construction of the least favorable priors. In Section 3 we construct 
an estimator based on Ghebyshev polynomials which achieves the minimax risk and the sample 
complexity within constant factors. In Section 4 we apply our estimators to both synthetic and 
real data and compare the performance with existing methodologies. Proofs of the lower and upper 
bounds are given in Section 5 and 6, respectively. 

1.5 Notations 

For k £ N, let [k] = {1,..., A:}. The n-fold product of a distribution P is denoted by P®"-. Let 
Poi(A) denote the Poisson distribution with mean A whose probability mass function is denoted 
by poi(A,j) = —,j > 0. Given a positive random variable U, denote the Poisson mixture 

with respect to the distribution of C/ by E [Poi(?7)], whose probability mass function is given by 
jiK[U^e~^],j > 0. Let Bern(p) = p6i + (1 — p)6o denote the Bernoullili distribution. The total 
variation and the Kullback-Leibler divergence between probability measures P and Q are denoted 
by TV(P, Q) = 2 / jdP — dQ\ and D{P\\Q) = f dPlog^ respectively. We use standard big-O 
notations, e.g., for any positive sequences {a„,} and {b^}, an = 0{bn) or an ^ bn if Un < Cbn for 
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some absolute constant C > 0, = o{bn) or an <C bn or if liman/bn = 0. In order to extract non- 

asymptotic statements from asymptotic ones, we pay extra attention to o(l) terms. Specifically, 
we write 0^(1) as <5 —)• 0 to indicate convergence to zero that is uniform in all other parameters. 

2 Minimax lower bound 

The lower bound argument follows the idea in [LNS99, CLll, WY16] and relies on the generalized 
Le Cam’s lemma involving two composite hypothesis. In the following we illustrate the main idea 
for constructing a pair of priors that are near least favorable. 

Let A > 1. Given unit-mean random variables U and U' that take values in {0} U [1, A], define 
the following random vectors 

P = l{Ui,...,Uk), P' = i(17(,...,C/^), (11) 

where Ui and are i.i.d. copies of U and U', respectively. Although P and P' need not be 
probability distributions, as long as the standard deviation of U and [/' are not too big, the 
law of large numbers ensures that with high probability P and P' lie in a small neighborhood 
near the probability simplex, which we refer as the set of approximate probability distributions. 
Furthermore, the minimum non-zeros in P and P' are at least It can be shown that the minimax 
risk over approximate probability distributions is close to that over the original parameter space T>/. 
of probability distributions. This allows us to use P and P^ as priors and apply Le Cam’s method. 
Note that both S'(P) and 5'(P') are binomially distributed, which, with high probability, differ by 
the difference in their mean values: 

E[5(P)] - E[5(P')] = k{¥[U > 0] - ¥[U' > 0]) = k{¥[U' = 0] - ¥[U = 0]). 

If we can establish the impossibility of testing whether data are generated from P or P', the resulting 
lower bound is proportional to k{¥[U' = 0] — E[?7 = 0]). 

To simplify the argument we apply the Poissonization technique where the sample size is a Poi(n) 
random variable instead of a fixed number n. This provably does not change the statistical nature 
of the problem due to the concentration of Poi(n) around its mean n. Under Poisson sampling, 

the histograms (2) still constitute a sufficient statistic, which are distributed as Ni Poi(npi), as 
opposed to multinomial distribution in the fixed-sample-size model. Therefore through the i.i.d. 
construction in (11), ■E[Poi(^C/)] or E[Poi(^C/')]. Then Le Cam’s lemma is applicable if 

TV(E[Poi(|?7)]®*’,E[Poi(|^[/')]®^) is strictly bounded away from one, for which it suffices to show 

TV(E[Poi(n[//A:)],E[Poi(nU7A:)]) < (12) 

rC 

for some constant c < 1. 

The above construction provides a recipe for the lower bound. To optimize the ingredients it 
boils down to the following optimization problem (over one-dimensional probability distributions): 
Construct two priors U, U' with unit mean that maximize the difference E \U' = 0] — ¥[U = 0] 
subject to the total variation distance constraint (12), which, in turn, can be guaranteed by moment 
matching^ i.e., ensuring U and U' have identical first L moments for some large L, and the Loo-norms 
U, U' are not too large. To summarize, our lower bound entails solving the following optimization 
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problem: 


sup P[[/' = 0] - F[U = 0] 
s.t. E[U] = E[U'] = 1 

E[W]=E[U% j = l,...,L 
U,U' G {0}U [1,A]. 


The final lower bound is obtained from 13 by choosing L x log A: and A x 

In order to evaluate the infinite-dimensional linear programming problem 13, by considering its 
dual program we show (in Appendix A) that 13 coincides exactly with the best uniform approxi¬ 
mation error of the function x ^ over the interval [1, A] by degree-(L — 1) polynomials: 


inf sup- p{x) , 

P6Pl- 1 3;6[gA] X 


where Vl-i denotes the set of polynomials of degree L — 1 . The problem of best polynomial 
approximation has been well-studied, cf. [Tim63, DS08]; in particular, the exact formula for the 
best polynomial that approximates x i—)• - and the optimal approximation error have been obtained 
in [Tim63, Sec. 2 . 11 . 1 ], 

Applying the procedure described above, we obtain the following sample complexity lower 
bound: 

Proposition 1. Let 5 = and r = k ^ oo, S —)■ 0 and r — )• 0, 

n*{k,€) > (1 - 05 ( 1 ) -Ofc(l) -o^(l)) ^^^^^^^ log^ (14) 

Consequently, if -^ < e < ^ — c' for some constants c, c' then n*{k, e) > log^ 

The lower bounds announced in Theorems 1 and 2 follow from Proposition 1 combined with a 
simple two-point argument. See Section 5.2. 


3 Optimal estimator via Chebyshev polynomials 


In this section we prove the upper bound part of Theorem 1 and describe the rate-optimal support 
size estimator. Following the same idea as in the lower bound part, we shall apply the Poissonization 
technique to simplify the analysis where the sample size is Poi(n) instead of a fixed number n and 

hence the sufficient statistics N = {Ni,... ,Nifj Poi(npi). Analogous to (4), the minimax risk 
under the Poisson sampling is defined by 

R*{k,n) = \ni sup E[£(5,5)]. (15) 

5 PeCfe 


Due to the concentration of Poi(n) near its mean n, the minimax risk with fixed sample size is close 
to that under the Poisson sampling, as shown in the following lemma, which allows us to focus on 
the model using Poissonized sample size. 


Lemma 1. For any (3 < 1, 


R*{k,n) < 


R*{k,{l-P)n) 

1 — exp(—n/ 32 / 2 ) ’ 
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In the next proposition, we first analyze the risk of the plug-in estimator Sgeen, which yields the 
optimal upper bound of Theorem 1 in the regime of n > A: log k. This is consistent with the coupon 
collection intuition explained in Section 1.2. 

Proposition 2. For all n,k > 1, 

sup E{S{P) - Sseen{N)f < (16) 

PeCfc 


where N = (A^i, A^ 2 ) • • •) o,nd Ni ~ Poi(npj). 

Conversely, for P that is uniform over [k], for any fixed 6 G (0,1), if n < (1 — (5)/i:log then 
as k ^ oo, 

P[|5(P) - 4een(iV)| < ek] < (17) 

To specify the optimal estimator in the regime of n < klogk, we first introduce Chebyshev 
polynomials. Recall that the usual Chebyshev polynomial of degree L is 

Tl{x) = cos(Larccosx) = {z^ + z~^)l2, (18) 


where is the solution of the quadratic equation 2 ; + z~^ = 2x. Note that Tl is bounded in 
magnitude by one over the interval [—1,1]. The shifted and scaled Chebyshev polynomial over the 
interval [Z,r] is given by 


Pl{x) = 


Tl{^) 


L 


A 



m=0 


(19) 


which satisfies Pl( 0) = — 1 and hence oq = —1; the remaining coefficients oi,..., can be obtained 
from those of the Chebyshev polynomial [Tim63, 2.9.12] and the binomial expansion, or more 
directly, 


C\o} ^ _ (^Y 1 rf (-Gz) 

j! \r-j) jl ■ 


( 20 ) 


Let 


giij) 


ll’ 


j < L, 

j>L. 


( 21 ) 


Obviously (7 l( 0) = 0 since oq = — 1 by design. We Define our estimator by 


s = Y,9m). 

i 


( 22 ) 


We proceed to explain the reasoning behind the estimator (22). Note that the bias is E[5' — S'] = 
l{p->o}]. Since 9^(0) = 0 and griJ) = 1 for j > L, each term in the bias can be 

written as 


E [giiNi] - l{p,>0}] =IE [{gL{Ni) - l)l{p,>0}l{Ar,<L}] 


\ ^ -npi {npiy ajjl 


i=o 


(23) 


where Pl is the degree-L polynomial defined in (19). 
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Let 


r A I 1 ,1 A Cl log A: 

L=[cologA:J, r= -, 

n 




k' 


(24) 


where cq < ci are constants to be specified. The main intuition is that since cq < ci, then with high 
probability, whenever Ni < L = [cq log A:J the corresponding mass must satisfy pi < ^ . That 

is, if Pi > 0 and Ni < L then p* G [^, and hence PiiPi) is bounded by the sup-norm of Pl 

over the interval [^, In view of the property of Chebyshev polynomials [Tim63, Ex. 2.13.14], 

(19) is the unique degree-L polynomial that passes through the point (0, —1) and deviates the least 
from zero over [^, This explains the coefficients (21) which are chosen to minimize the bias. 

The next proposition gives an upper bound of the quadratic risk of our estimator (22): 


Proposition 3. Let cq = 0.558 and ci = 0.5. Ts 5 = 


k log k 


0 and k —)■ oo, 


sup E{S{N) - S{P))'^ < 4k‘^{l + Ofc(l)) exp -(2 -|- 05 ( 1 )) 

PeDfe \ 


n log k 


k 


(25) 


where N = (A^i, A^ 2 , • • •) ~ Poi(npi), and k = 2.494. 

The minimax upper bounds in Theorems 1 and 2 follow from combining Propositions 2 and 3. 
See Section 6.2. 

The estimator (22) belong to the family of linear estimators: 

5 = j;/(iV,) = (26) 

* i>i 

which is a linear combination of fingerprints hj’s defined in (3). Other notable examples of linear 
estimators include: 

• Plug-in estimator (7): 5seen = hi + h 2 + ■■■ ■ 

• Good-Toulmin estimator [GT56]: for some t > 0, 

<SgT = *§seen + thi — t^/l 2 -|- — t^/l 4 -|- . . . (27) 


• Efron-Thisted estimator [ET76]: for some t > 0 and J G N, 

J 

Set = Sseen + ^(-l)''+44,h„ 
J =1 


where bj = P[Binomial(J, l/(t -|- 1)) > j]. 

By definition, our estimator (22) can be written as 


(28) 


L 

S = '^gL{j)hj + hj. (29) 

1=1 j>L 

By (21), gi is also a polynomial of degree L, which is oscillating and results in coefficients with 
alternating signs (see Eig. 1). Interestingly, this behavior, although counterintuitive, coincide with 
many classical estimators, such as (27) and (28). 











Figure 1: Coefficients of estimator gi in (21) with cq = 0.45, ci = 0.5, k = 10® and n = 2 x 10®. 

Remark 1 (Time complexity). The evaluation of the estimator (26) consists of three parts: 

1. Construction of the estimator: O(T^) = 0{log'^ k), which amounts to computing the coeffi¬ 
cients fiij) per (20); 

2. Computing the histograms Ni and fingerprints hj: 0{n)] 

3. Evaluating the linear combination: 0{nAk), since the number of non-zero terms in the second 
summation of (26) is at most n Ak. 

Therefore the total time complexity is 0{n + log^ k). 

Remark 2. The technique of polynomial approximation has been previously used for estimat¬ 
ing non-smooth functions (L^-norms) in Gaussian models [INK87, LNS99, CLll] and more re¬ 
cently for estimating information quantities (entropy and power sums) on large discrete alphabets 
[WY16, JVHW15]. The design principle is to approximate the non-smooth function on a given in¬ 
terval using algebraic or trigonometric polynomials for which unbiased estimators exist and choose 
the degree to balance the bias (approximation error) and the variance (stochastic error). Note that 
in general uniform approximation by polynomials is only possible on a compact interval. There¬ 
fore, in many situations, the construction of the estimator is a two-stage procedure involving sample 
splitting: First, use half of the sample to test whether the corresponding parameter lies in the given 
interval; Second, use the remaining samples to construct an unbiased estimator for the approxi¬ 
mating polynomial if the parameter belongs to the interval or apply plug-in estimators otherwise 
(see, e.g., [WY16, JVHW15] and [CLll, Section 5]). 

While the benefit of sample splitting is to make the analysis tractable by capitalizing on the 
independence of the two subsamples, the downside is obviously sacrificing the statistical accuracy 
since half of the samples are wasted. In the present paper, to estimate the support size, we forgo the 
sample splitting approach and directly design a linear estimator. Instead of using a polynomial as 
a proxy for the original function and then constructing its unbiased estimator, the best polynomial 
approximation arises as a natural step in controlling the bias (see (23)). 

4 Experiments 

We evaluate the performance of our estimator on both synthetic and real datasets in comparison 
with popular existing procedures. In the experiments we choose the constants cq = 0.45, ci = 0.5 in 
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(24), instead of cq = 0.558 which is optimized to yield the best rate of convergence in Proposition 3 
under the iid sample model. The reason for such a choice is that in the real-data experiments the 
samples are not necessarily generated independently and dependency leads to a higher variance. 
By choosing a smaller cq, the Chebyshev polynomials have a slightly smaller degree, which results 
in smaller variance and more robustness to model mismatch. Each experiment is averaged over 50 
independent trials and the standard deviations are shown as error bars. 

Synthetic data We consider data independently sampled from the following distributions, (a) 
the uniform distribution with Pi = (b) Zipf distributions with pi oc and a being either 1 or 

0.5, (c) an even mixture of geometric distribution and Zipf distribution where for the first half of the 
alphabet pi oalji and for the second half pj_|_fc /2 oc (1 — f 1 < ^ The alphabet size k varies 
in each distribution so that the minimum non-zero mass is roughly 10“®. Accordingly, a degree-6 
Chebyshev polynomial is applied. Therefore, according to (29), we apply the polynomial estimator 
gL symbols appearing at most six times and the plug-in estimator otherwise. We compare 
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Figure 2: Performance comparison under four data-generating distributions. 

our results with the Good-Turing estimator [Goo53], the two estimators proposed by Chao and 
Lee [CL92], and the linear programming approach proposed by Valiant and Valiant [VV13]. Here 
the Good-Turing estimator refers to first estimate the total probability of seen symbols (sample 
coverage) by C = 1 — ^ then estimate the support size by 5 = Sseen/C. The plug-in estimator 
simply counts the number of distinct elements observed, which is always outperformed by the 
Good-Turing estimator in our experiments and hence omitted in the comparison. 

Good-Turing’s estimate on sample coverage performs remarkably well in the special case of 
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uniform distributions. This has been noticed and analyzed in [CL92,DR80]. Chao-Lee’s estimators 
are based on Good-Turing’s estimate with further correction terms for non-uniform distributions. 
However, with limited number of samples, if no symbol appears more than once, the sample coverage 
estimate C is zero and consequently the Good-Turing estimator and Ghao-Lee estimators are not 
even well-defined. For Zipf and mixture distributions, the output of Ghao-Lee’s estimators is highly 
unstable and thus is omitted from the plots; the convergence rate of Good-Turing estimator is much 
slower than our estimator and the linear programming approach, partly because it only uses the 
information of how many symbols occurred exactly once, namely hi, instead of the full spectrum 
of fingerprints the linear programming approach has similar convergence rate to ours but 

suffers large variance when samples are scarce. 

Real data Next we evaluate our estimator by a real data experiment based on the text of Ham¬ 
let, which contains about 32, 000 words in total consisting of about 4,800 distinct words. Here 
and below the definition of “distinct word” is any distinguishable arrangement of letters that are 
delimited by spaces, insensitive to cases, with punctuations removed. We randomly sample the 
text with replacement and generate the fingerprints for estimation. The minimum non-zero mass is 
naturally the reciprocal of the total number of words, 32 qqo • experiment we use the degree-4 

Chebyshev polynomial. We also compare our estimator with the one in [VV13]. The results are 
plotted in Fig. 3, which shows that the estimator in [VV13] has similar convergence rate to ours; 



Figure 3: Comparison of various estimates of the total number of distinct words in Hamlet. 

however, the variance is again much larger and the computational cost of linear programming is 
significantly higher than linear estimators, which amounts to computing linear combinations with 
pre-determined coefficients. 
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On a larger scale experiment we used the New York Times Corpus from the years 1987 ~ 2007.^ 
This corpus has a total of 25,020,626 paragraphs consisting of 996,640,544 words with 2,047,985 
distinct words. We randomly sample 1% - 50% out of the all paragraphs with replacements and 
feed the fingerprint to our estimator. The minimum non-zero mass is also the reciprocal of the 
total number of words, 1/10®, and thus the degree-9 Chebyshev polynomial is applied. Using only 
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Figure 4: Performance of our estimator using New York Times Corpus. 


20% samples our estimator achieves a relative error of about 10%, which is a systematic error 
due to the model mismatch: the sampling here is paragraph by paragraph rather than word by 
word, which induces dependence across samples as opposed to the iid sampling model for which 
the estimator is designed. For this large dataset the linear programming estimator has unbearable 
computational cost: Even for the data of a single year the linear programming takes over 100 
hours to compute on a server with E5-2623 CPU and 96 GB RAM; in contrast, the proposed linear 
estimator takes less than 15 minutes to run for the entire 20-year dataset on the same computer, 
which clearly demonstrates its computational advantage even if one factors into the difference that 
our implementation is based on C-|—|- instead of MATLAB used in [VV13]. 

Finally, we perform the classical experiment of “how many words did Shakespeare know”. We 
feed the fingerprint of the entire Shakespearean canon (see [ET76, Table 1]), which contains 31,534 
word types, to our estimator. We choose the minimum non-zero mass to be the reciprocal of the 
total number of English words, which, according to known estimates, is between 600,000 [ED] to 
1,000,000 [Mon], and obtain an estimate of 63,148 to 73,460 for Shakespeare’s vocabulary size, as 
compared to 66,534 obtained by Efron-Thisted [ET76]. 

^Data available at https://catalog.ldc.upenii.edu/LDC2008T19. 
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5 Proof of lower bounds 


5.1 Proof of Proposition 1 

Proof. For 0 < v < 1, define the set of approximate probability vectors by 


= {P = {pi,P2, ■■■) 




< u,pi e {0} u 


i + u 
k 



which reduces to the original probability distribution space Pk if = 0. Generalizing the sample 
complexity n*{k,e) in (8) to the Poisson sampling model over Pfc(z/), we define 

n*(A:, e, zz) = min{n > 0: 35, s.t. P[|5 — 5(P)| > efc] < 0.1, VP G Dfc(z/)}, (30) 


where S is an integer-valued estimator measurable with respect to N = (A^i, N 2 , ■ ■ ■) Poi(npi). 
The sample complexity of the fixed-sample-size and Poissonized model is related by the following 
lemma: 


Lemma 2. For any v G (0,1) and any e G (0, 

n*{k, e) > (1 — u)n*{k, e, z/) ( 1 — O 


1/(1 - u)n*{k,e,i') ^ 


(31) 


To establish a lower bound of n*{k,e,i'), we apply generalized Le Cam’s method involving 
two composite hypothesis. Given two random variables U, U' G [0, k] with unit mean we can 
construct two random vectors by P = j:{Ui,... ,Uk) and P' = ^(P(,..., P(,) with i.i.d. entries. 
Then E[5(P)] — E[5(P')] = k{F[U > 0] — E[P' > 0]). Furthermore, both 5(P) and 5(P') are 
binomially distributed, which are tightly concentrated at the respective means. We can reduce the 
problem to the separation on mean values, as shown in the next lemma: 

Lemma 3. Let U,U' G {0} U [1 -|- z^, A] be random variables such that E[P] = E[P'] = 1, E[P-^] = 
E[U'^] for j G [L], and |E[P > 0] — E[P' > 0]| = d. Then, for any a < 1/2, 


2 

/cz^2 ka^d^ 



< 0.6 ^ h* 


(1 — 2Q;)ci A 

’ 2 


> n. 


(32) 


Applying Lemma 5 in Appendix A, we obtain two random variables U, U' G {0} U [1 -|- zz. A] such 
that E[P] = E[P'] = 1, E[W] = E[U'^,j = 1,..., P and 


E[U > 0] - E[U' > 0] 


2Pl-i 


(i,[l + ^,A|) 



1 + ZZ 



A 


d, 


where the value of Pi_i(A, [1 -|- u, A]) follows from [Tim63, 2.11.1]. To apply Lemma 3 and obtain a 
lower bound of h*{k, e, n), we need to pick the parameters depending on the given k and e to fulfill: 


(1 — 2 a)d ^ 

2 - 

2A 2 , (enX\^ 


(33) 

(34) 
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Let 


L = [co log k \, 


A 


/ 7iog/c y 
Vlog(l/26)y' ’ 


n = C 


k 

log k 


log^ 


1 


V = \J ^/A7fe(l 


for some 00 , 7 , 6 * x 1 to be specified, and by assumption L, A —)• 00 , = 0 ^( 1 ), = Ot-(1) + 

Ofc(l), 1/A = 0 ^( 1 ) . Since d > ^ a sufficient condition for (33) is that 


1-2 


—r— I > 2e--— 77 — > 2 + Ot-(1) + 0^(1) + Ofc(l). 

A / 1 — 2a Co 


(35) 


Now we consider (34). By the choice of v and a, we have 

> y/X/k, a:^l/Vkd, 

since 1 — 2e , d > and e = Then the first two terms in (34) vanish. The last 

term in (34) vanishes as long as the constant C < . By the fact that 

•0<2co<7} = ^, 

the optimal C satisfying (35) is ^+°6W+or{i)+Ok{i) ^ Ti^ej-gfore, combining (33) - (34) and applying 
(32), we obtain a lower bound of fi* that 


n*{k, e, u) > 


1 + 05 ( 1 ) + Ot(1) + oa;(1) ^ I 2 1 

26^ log A; 2e’ 


Since 1 — 2e ;g> ; we have h*{k, e, v) S> '/k. Applying Lemma 2, we conclude the desired lower 

bound of n*(fc, e). □ 


5.2 Lower bound parts of Theorems 1 and 2 

Proof of lower bound of Theorem 2. The lower bound part of (10) follows from Proposition 1. Con¬ 
sequently, we obtain the lower bound part of (9) for ^ < e < cq for the fixed constant cq < 1/2. 

The lower bound part of (9) for ^ < e < ^ simply follows from the fact that e 1 —)> n*{k,e) is 
decreasing: 

n*(k,e) > n*(k,l/k'^) > k log k x ^ log^ -. □ 

log k e 

Proof of lower bound of Theorem 1 . By the Markov inequality, 

n*{k,e) > n R*{k,n) > O.le^. 


Therefore, our lower bound is 

R*{k,n) > supjO.le^ : n*{k,e) > n} = O.le^, 
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where e* = {e : n*{k, e) > n}. By the lower bound of n*{k, e) in (14), we obtain that 


e* > exp (^\/2e + 05(1) + oy(l) + Ofc(l)^ 

~ fciogfc —>■ 0, 5' = —>■ 0, and k —)• 00. Then we conclude the lower bound part of (6), 

which implies the lower bound part of (5) when n < k\ogk. 

For the lower bound part of (5) when n > A; log A:, we apply Le Cam’s two-point method [LC86] 
by considering two possible distributions, namely P = Bern(O) and Q = Bern(^). Then 

R*{Kn) > J(5(P) - S{Q)f eM-nD{P\\Q)) = ^ exp (^nlog (^1 - - 21ogA:) . 

Since n > A;logA;, we have nlog (l — j) — 2 log A; > □ 



5.3 Proof of lemmas 

Proof of Lemma 2. Fix an arbitrary P = {pi,P 2 , ■ ■ ■) £ Vk{y)- Let N = (A^i, N 2 , ..Poi(npj) 
and let n' = ^ iVj ~ Poi(n^pj) >s.t. Poi(n(l — v)). Let Sn be the optimal estimator of support 
size for fixed sample size n, such that whenever n > n*(k,e) we have P[|5n — *S'(P)| > eA:] <0.1 
for any P G We construct an estimator for the Poisson sampling model by S{N) = Sn'{N). 
We observe that conditioned on n' = m, iV ~ Multinomial(m, ). Note that y/’ G Pk by the 
definition of Pk{i'). Therefore 


S{N) - SiP) 


> ek 


E' 

m=0 


SUN) - s 


p 


HiPi 


> ek 


\ji' = m] 


< 0.1 P[n' > n*] + P[n' < n*] = 0.1 -|- 0.9 P[n^ < n*] 

< 0.1 + 0.9P[Poi(n(l — v)) < n*]. 


If ^ fo'^ > 0) then Chernoff bound (see, e.g., [MU05, Theorem 5.4]) yields that 

P[Poi(n(l — v)) < n*] < exp(—n*(/3 — log(l + /3))). 

By picking (3 = for some absolute constant (7, we obtain h* < ^ and hence the 

lemma. □ 


Proof of Lemma 3. Define two random vectors 



where Ui and Lf[ are i.i.d. copies of Lf and U', respectively. Conditioned on P and P' respectively, the 

corresponding histogram N = (A^i,..., Nk) Poi(nf4/A:) and N' = {N [,..., Poi(nL7/A;). 

Define the following high-probability events: for a < 1/2, 


E = 
E' = 


k 

T-U' 

i—ji I _ 1 

k 


< I/, |5(P) -E[5(P)]| < akdj , 

< i^, |5(P') -E [5(P')]| < akd 
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Now we define two priors on the set T)j^[v) by the following conditional distributions: 

= Pp\Ei = Pp'\E'- 

First we consider the separation of the support sizes under vr and vr'. Note that E[5(P)] = 
kF\U > 0] and E[5(P')] = kF\U' > 0], so |E[5'(P)] — E[5'(P')]| > kd. By the definition of the events 
E, E' and the triangle inequality, we obtain that under vr and vr', both P, P' G Ek{^) and 

|5(P)-5(P')| > (l-2a)A:d. (36) 

Now we consider the total variation distance of the distributions of the histogram under the 
priors tt and vr'. By the triangle inequality and the fact that total variation of product distribution 
can be upper bounded by the summation of individual one, 


P^{Pn\e,Pn'\e') < P^{Pn\e-,Pn) + '^^{Pn-,Pn') + P^{Pn'■,Pn'\ei) 

= E[^^] + TV (^(E[Poi(n^7/A:)])®^ {E[Poi{nU'/+ F[E‘ 
< F[E^] + F[E'^] + /cTV(E[Poi(nC//A:)],E[Poi(n[/7A:)]). 


(37) 


By the Chebyshev’s inequality and the union bound, both 

P[F;^],E[F;7 < E +P[|,S(P)-E[5(P)]| > afed] 




{ki^y 


{akdy 


kv'^ ka^d"^ ’ 


(38) 


where we upper bounded the variance of U by var[C/] < E[17^] < E[AC/] = A. 

Applying the total variation bound for Poisson mixtures in Lemma 6 (see Appendix B) yields 
that 


TV(E[Poi(nl7/A:)],E[Poi(nt/7A;)]) <(^ 

Y ZkLj 

Plugging (38) and (39) into (37), we obtain that 

N 2A 2 /enAA^ 

Applying Le Cam’s lemma [LC86], the conclusion follows from (36) and (40). 


(39) 

(40) 

□ 


6 Proof of upper bounds 

6.1 Proof of Propositions 2 and 3 

Proof of Proposition 2. First we consider the bias: 

|E(4een(P) " S{P))\ = J](l - E(iV, > l))l{p^> l| 

i i 

< kexp{—n/k). 
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The variance satisfies 


var[4een(^’)] = X]''^''^{^*>0}“ X]^{pi>i} “ ^exp(-n//c). 
i i 


The conclusion follows. 

For the negative result, under the Poissonized model and with the samples drawn from the 
uniform distribution, the plug-in estimator .Sseen is distributed as Binomial(/c, 1 — If n < 

(1 — (5)fclog ^ < A; log then 1 — < 1 — e. By the Chernoff bound, 

P[|4een - S{P)\ < ek] = P[Binomial(A:, 1 - e”’"/'') > (1 - e)A:] 

g-A:d(l-e||l-e-"/'“) _ g-fcd(e||e-"/'=) 


where d{p\\q) = plog ^ + [1 — p) log is the binary divergence function. Since e '^ > e, 

d{t\\e-^/'^) > d(e||ei-‘^) > d{k-^\\k-^+^) x k-^+^, 

where the middle inequality follows from the fact that e i—)> d(e||e^“'^) is increasing near zero. 
Therefore P[|5seen — 5'(T’)| < efc] < exp(—n(A:'^)). □ 

Proof of Proposition 3. First we consider the bias. Recall that L = [cq log k\,r = ^ , I = \- By 

(23) the bias of S is the summation of 

b{pi) = e-^P^PL{pi)l{p^>o}. 

Obviously 6(0) =0. If / < x < r then |Pl(x)| < by the design of Pl in (19). 

Therefore |6(x)| < if < 1, 


|6(x)| < max e 
r<a:<l 


1 ^l(x)| = 


l<y< 


max 

2-r-l 


exp(-nr(l - 5)y/2)TL{y) 


exp(—nr(l -|- 6)/2) 




(41) 


We need the following lemma: 
Lemma 4. If 13 = 0{L), then 


maxe-^*rL(x) 

X>1 



p + 1 
yT+lTo^ 


(l+w(l)) 


L 


oo. 


(42) 


where a = ^. 

Applying Lemma 4 to (41) with L = [colog/cj, (3 = nr(l — 5)/2 and a = 2p -|- o(l) where 
p A co/ci, we obtain that 


l&(a;)l < 2 


1 (2p + ^{2py + i 


^y/i+i/(2pf 

2p -|- -^/(2pp~+T 

Vl+l/(2p)Wl/(2p) 


(1 -I- Ofc(l)) 


exp(-^(l + 05(l))) 


\ L 


(1 + Ofc(l) -|- 0,5(1)) 


1 




17 



















Therefore b{pi) is uniformly bounded by as long as we pick the constant p such that 

Ii:r5) I 

—— < 1 or equivalently, p < p* ^ l.l. Then the bias of S is at most 

eVi+i/{2p)2+i/(2p) ’ ^ r r 

|E[5 - g|| < ^ ^ 2t(l + »t(l) + °i(l)) (1 - '' 




= 2k{l + Ofc(l)) exp -(1 + os{1))\Ucqp 


i + Vs, 


nlogk 

~V~ 


(43) 


Now we turn to the variance of S: 


var[;§] = [{gL{Ni) - l)l{Ari<L}] 

i:pi>0 

< Y, ^[{9L{m-l)\N.<L}] 


i:pi>0 


EE 


ajjl\ ^_r,pAnpiy 




f- 


E 


rP 




i:pi>0j=0 ^ 2 j^Q 

where hj = J2i ^{Ni=j} is the fingerprint of samples. By definition j/ij] = n. Therefore 


var[5] < E[ho] + ^ 


^ {ajjl/n^y {cLjjl/ 

—-—^- ¥,[jhj\ <k + nL max ^ 


n- 




i=i 


^<j<L j 


(44) 


Recall the polynomial coefficients aj given in (20): 


Qi 7 — 


2 1 


' \r-ij j\ 

Applying Markov brothers’ inequality [Mar92] on the scaled interval [—^]) we obtain that 

-u p\ 

(45) 


J'- 


rh> 


?! 

< 


k 2 V l 2^j! L (L + j 


nJ \r — I J jl I L + j \ 2j J \n{r + 1) J L + j \ 2j 


L (L + j 


We use the following bound on binomial coefficients [Ash65, Lemma 4.7.1]: 

O 


TT 


< 


(27rnA(l — A ))“^/2 exp(n/i(A)) 


< 1 . 


(46) 


where A = ^ G (0,1) and /i(A) = —A log A — (1 — A) log(l — A) denotes the binary entropy function. 
Therefore, from (45) and (46), for j = 1,..., L — 1, 


Ojl 


< 


^ . L exp((L + 7 X;gj)) 


7- 

n{r + 1) J L + j 




- (:t) 2 


(47) 
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where we used the fact that maxiprr_ii , ^ ^ for L > 2. From (45), the 

‘ V47ri(L-i)(L+i) y/ML^-l) - 2 - ^ 

upper bound (47) also holds for j = L. Using (47) and Striling’s approximation that n! < ey/n{^)'^, 


(ajjl/ 


n- 


il2 


< - 
j 


Cl log k 


2i 


eV7 


2i 


exp 


[2iL + j)h 


2j 


i + j 


_ ^^2co(;31og i^ + (l+/3)h(j^)) ^ ^ j ^2 cot ( p ) 

4 “ 4 ’ 


(48) 


where /? = jjL and r(p) = sup^gjg i](/31og ^ + (1 + I3)h{^^)), which occurs at /3 = 

Note that from (43) the squared bias of S is from (44) and (48) the variance of S is at 

most 

p2 

var[5] <k + —, (49) 

which is ^/ji+2coT(p)+oi(i)_ Therefore as long as we pick constant cq such that 2 cot(p) < 1 the 
variance of S in (49) is lower order than the squared bias of S in (43), and thus the MSE of S is at 
most 

E(5 - Sf < 4A:2(1 + Ofc(l)) exp ^-2(1 + 05(1)) 

The conclusion follows from the fact that sup^^^* 2p/T{p) « 2.494, which corresponds to choosing 
Co ~ 0.558 and ci = 0.5. □ 


I 2p nlogk 
t{p) k 


6.2 Upper bound parts of Theorems 1 and 2 

Proof of upper bound of Theorem 1. Combining Lemma 1 and Proposition 3 yields the upper bound 
part of (6), which also implies the upper bound of (5) when n < k\ogk. The upper bound part of 
(5) when n > A: log A: follows from Proposition 2. □ 

Proof of upper bound of Theorem 2. By the Markov inequality, 

R*{k, n) < O.le^ ^ n*{k, e) < n. (50) 


Therefore our upper bound is 

n*{k,e) < inf{n : R*{k,n) < O.le^}. 


By the upper bound of R*{k, n) in (25), we obtain that 


n*{k, e) < 


1 + oy (1) + Oe(l) + Ofc(l) A: 2I 

--rlog - 

K log k e 


g^g J/ A '°gUA) A 0, e —>■ 0, and k —)• 00. Consequently, we obtain the upper bound part of (9) when 
^ < e < Co for the fixed constant cg < 1/2. 

The upper bound part of Theorem 2 when ^ < e < follows from the monotonicity of 
e I—)• n*{k, e) that 

n*(k,e) < n*(k,l/k) < 3k log k x , ^ , log^ -, 

log k e 

where the middle inequality follows from Proposition 2 and (50). □ 
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6.3 Proof of lemmas 


Proof of Lemma 1. We follow the same idea as in [WY16, Appendix A] using the Bayesian risk 
as a lower bound of the minimax risk with a more refined application of the Chernoff bound. We 
express the risk under the Poisson sampling as a function of the original samples that 

R*{k,{l — j5)n) = inf sup 5(P))], 

{5^} PeCfe 

where n' ~ Poi((l — /3)n). The Bayesian risk is a lower bound of the minimax risk: 


R*{k, (1 — /3)n) > sup inf E[£(5„/, S'(P))], 

TT {Sm} 

where tt is a prior over the parameter space For any sequence of estimators {^m}, 

n 

E[£(5„q5)] = ^ E[£(5^,5)]P[n' = m] > E[£(S™, 5)]E[n' = m]. 


(51) 


m>0 


m=0 


Taking infimum of both sides, we obtain 

n n 

inf E[£(5„/, S)] > inf V E[i{Sm, S)]P[n' = m] = V mfE[£{Sm, 5)]P[n' = m], 

{Sm} Sm 

Note that for any fixed prior vr, the function m i—)• inf^ E[.^(S'm, S)] is decreasing. Therefore 
inf E[£(5„/,S)] > infE[£(5n,5)]E[n' < n] 

{Sm} Sn 

> infE[£(5n,5')](l - exp(n(/3 + log(l -/3)))) 

Sn 

> infE[^(,Sn,5)](l - exp(-n,d^/2)), 

Sn 


(52) 


where we used the Chernoff bound (see, e.g., [MU05, Theorem 5.4]) and the fact that log(l — x) < 
—X — x‘^/2 for X > 0. Taking supremum over tt on both sides of (52), the conclusion follows from 
(51) and the minimax theorem (cf. e.g. [Str85, Theorem 46.5]). □ 

Proof of Lemma 4- By assumption, a = ^ is strictly bounded away from zero. Let /(x) = 
cosh(L arccosh(x)) when x > 1. By taking the derivative of /, we obtain that / 


-Pxrp, 


— p-Sx 


L{x) = e 
is decreasing if and only if 


tanh(L arccosh(x)) tanh(Ly) 1 

Vx2 - 1 sinh(y) a 

where x = cosh(y). Let g{y) = • Note that g is strictly decreasing on ]R_|_ with g{0) = L and 

g{oo) = 0. Therefore / attains its maximum at x* which is the unique solution of = 

It is straightforward to verify that the solution satisfies x* = \/l + a'^(l — ol{1)) when a is 
strictly bounded away from zero. Therefore the maximum value of / is 

e-^^*TL{x*) = 

where we used (18) and z = x* + \/x*‘^ — 1 = (Vl + + a)(l — ol{V)) is strictly bounded away 

from 1. This proves the lemma. □ 
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A Dual program of (13) 

Define the following infinite-dimensional linear programming problem: 

= sup F[U' = 0] - F[U = 0] 
s.t. E [D] = E [17'] = 1 

E[W]=E[U'^], j = l,...,L + l, 

U, U' G {0} U I, 

where I = [a, b] with 5 > a > 1. Then (13) is a special case of (53) with I = [1, A]- 
Lemma 5. = infpg-p^ sup^jg/ | ^ — p{x) \. 

Proof. We first show that 13 coincides with the following optimization problem: 

S 2 — sup E 

s.t. E [X^] = E , j = l,...,L, 

X,X' E I. 

Given any feasible solution U, U' to 13, construct X, X' with the following distributions: 

Pv(dx) = xPu{dx), 

Px'idx) = xPij>{dx), 

It is straightforward to verify that X, X' are feasible for (54) and 

= E [[/' = 0 ] - E [[/ = 0]. 


£^>E 


1 

X 


-E 



( 53 ) 


(54) 


(55) 


Therefore £2 > £ 1 - 

On the other hand, given any feasible X, X' for (54), construct U, U' with the distributions: 


Pu{du) = (^1 — E 
Pij! (du) = ( 1 — E 


1 

X 
■ 1 


1 


5o{du) + -Pxidu) 


u 


X' 


(56) 


5o(du) + -Px'{du), 
u 


which are well-defined since X, X' > 1 and hence E [^] < 1,E [y] < 1. Then U,U' are feasible 
for 13 and hence 

£{ > E [[/' = 0] - E [G = 0] = E 

Therefore P £ 2 - Finally, the dual of (54) is precisely the best polynomial approximation problem 
(see, e.g., [WY16, Appendix E]) and hence 


1 

X 


-E 


1 

Y' 


£\ = £% = inf sup 

peT’i ^6/ 


— vY 

X 


□ 
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B Total variation between Poisson mixtures 


The total variation distance between two Poisson mixtures is obtained in the following lemma, which 
is an improvement of [WY16, Lemma 3] in terms of constants. This is crucial for our purposes of 
obtaining the best constants in the sample complexity bounds in (10). 

Lemma 6. Let V and V be random variables taking values on [0, A]. //E[l/'^] = j = 

1 ,..., L, then 

TV(E[Poi(V)],E[Poi(P')]) < ^ 2^/(2iog2)-L^ ^ ^57^ 

In particular, TV(E[Poi(P)], E[Poi(P')]) < Moreover, if L > |A, then 

TV(E[Poi(P)],E[Poi(P')]) < (1 + 0(1)), A^oo. 

Proof. Denote the best degree-L polynomial approximation error of a function / on an interval I 
by 

EL{f,I)= inf sup|/(x)-p(x)|. 
peT’i xe/ 

Let 

p ^ qr'j 

fjix) ^ ( 58 ) 

Let Pf j be the best polynomial of degree L that uniformly approximates fj over the interval [0, A] 
and the corresponding approximation error by EL{fj, [0, A]) = max^-gjo^A] \fj{x) — PLji^)\- Then 
EPf jiV) = EPl j{V') and hence 

^ OO 

TV(E[Poi(P)],E[Poi(V')]) = -^|E/,(V) -E/,-(P')l 

1=0 
^ OO 

< 2 E + Wjiy') - p*LAy'))\ 

1=0 

OO 

< Y,EL{fj,[0,A\). ( 59 ) 

1=0 


A useful upper bound on the degree-L best polynomial approximation error of a function / is 
via the Chebyshev interpolation polynomial, whose uniform approximation error can be bounded 
using the L*'^ derivative of /. Specifically, we have (cf. e.g., [Atk89, Eq. (4.7.28)]) 


EUf, [0, A]) < niax \fj{x) - Ql(/;x)| < 
a:e[0,A] 


1 

2^(L + 1)! 



L+l 


max 

3;e[0,A] 


fy+^\x) , 


(60) 


where QL{f;x) denotes the degree-L interpolating polynomial for / on Chebyshev nodes (roots 
of the Chebyshev polynomial). To apply (60) to / = fj defined in (58), note that can be 

conveniently expressed in terms of Laguerre polynomials: Denote the degree-n generalized Laguerre 
polynomial by L^f^ and the simple Laguerre polynomial by Ln{x) = Ln'^ . Recall the Rodrigues 
representation for Laguerre polynomials: 



X ^e^ d”' 

n! dx'^ 


(e-^x’^+^) 


(-i)‘ 


d* 

dk^ 


Efi+k (®)) 


k€N. 
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If j <L + 1 






d^+i-i 




-{Hx)e-^). 


d2;L+l-i ydx-? j! 

Note that Lj is a degree-j polynomial, whose derivative of order higher than j is zero. Applying 
general Leibniz rule for derivatives yields that 

/ T , ^ ■\ jrn T f \ 

/y+‘){x)= 

m =0 ' '' 


{L+l-j)Aj 


= (- 1 ) 


L+l-j^-x 


E 

m=0 


Applying [AS64, 22.14.13] 




n-\- k 


n 




„x /2 


when X > 0 and A: G N, we obtain that 

{L+l-j)Aj 

< P-^ 


Therefore max^-gjo^A] when j < L + 1.^ Then, applying (60), we have 

VEff (0 An < V= WAT 
1 )S^ 2 i(L+l)! (t+ 1 )! ■ 

If j > L + 2, the derivatives of fj are related to the Laguerre polynomial by 


E 

m=0 

/L +1 


L + l-j 
m 


J 

J — m 


^xl2 ^ g-x/2 (L + 1 


)■! 


(61) 


(62) 


(63) 


Again applying (62) when x > 0 and A: G N, we obtain 


/f+')(x) 


< 




3 

L + l 


e "/2 = 




g-x/2^j-L-l 


where the maximum of right-hand side on [0, A] occurs at x = (2(j — L — 1)) A A. Therefore 


max 

3;e[0,A] 




nj-L-l) 


j-L-l 


_1_g-A/2Aj-L-l 

(i-L-1)!^ A 


, L + l<j<L + l + A/ 2 , 
J > L + 1 -|- Aj2. 


Then, applying (60) and Stirling’s approximation that (A_E_Lii ^ i < 0 ^ ^ have 


E ^A/A0.Al)<)4lA;i 


2 ^ 


-L-l 


(A/2)^+i2^/2 

< ^ G /_ -r—. (64) 


j'>L-\-2 


2 '-(i+l)! -L-l)- 2 l(L+l)l 

j<L+lH-A/2 


^ i^A(/,-,[ 0 ,A])< 

j>L+l+A/2 


(A/ 2 ) 


L+l p—A/2 
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A^ 


)-L-l 


< 


(A/2) 


L+lpA/2 




(65) 


■^This is in fact an equality. In view of (61) and the fact that [AS64, 22.3], we have = 

'm V m / \i—m/ Vi/ 
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Assembling the three ranges of summations in (63)-(65) in the total variation bound (59), we obtain 


TV(E[Poi(y)],E[Poi(y')]) < h + + 2 ^/( 2 iog 2 )-L\ ^ 

[L + 1)! V / 

Finally, applying Stirling’s approximation (A+l)! > -y/27r(L + , we conclude TV(E[Poi(P)], E[Poi('F' 

{^)^. If L > f A > 21 ^ > t, then 2^/^-^ + 2A/(2iog2)-i = o(l). □ 

Acknowledgment 

This work was completed in part when Y.W. was visiting the Simons Institute for the Theory of 
Computing, whose generous support is acknowledged. The authors thank Luca Trevisan for helpful 
comments pertaining to Theorem 2. The authors are grateful to Dan Roth and Mark Sammons for 
help with the datasets used in Fig. 4. 


References 


[AS64] 

[Ash65] 

[Atk89] 

[BF93] 

[B079] 

[CCMNOO] 

[Cha84] 

[CL92] 

[CLll] 

[DR80] 


Milton Abramowitz and Irene A Stegun. Handbook of mathematical functions: with 
formulas, graphs, and mathematical tables. Courier Corporation, 1964. 

Robert B. Ash. Information Theory. Dover Publications Inc., New York, NY, 1965. 

Kendall E Atkinson. An introduction to numerical analysis. John Wiley & Sons, 1989. 

John Bunge and M Fitzpatrick. Estimating the number of species: a review. Journal 
of the American Statistical Association, 88(421) :364-373, 1993. 

Kenneth P Burnham and W Scott Overton. Robust estimation of population size when 
capture probabilities vary among animals. Ecology, 60(5):927-936, 1979. 

Moses Charikar, Surajit Chaudhuri, Rajeev Motwani, and Vivek Narasayya. Towards 
estimation error guarantees for distinct values. In Proceedings of the nineteenth ACM 
SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (PODS), 
pages 268-279. ACM, 2000. 

Anne Chao. Nonparametric estimation of the number of classes in a population. Scan¬ 
dinavian Journal of statistics, pages 265-270, 1984. 

Anne Chao and Shen-Ming Lee. Estimating the number of classes via sample coverage. 
Journal of the American statistical Association, 87(417):210-217, 1992. 

T.T. Cai and M. G. Low. Testing composite hypotheses, Hermite polynomials and 
optimal estimation of a nonsmooth functional. The Annals of Statistics, 39(2): 1012- 
1041, 2011. 

JN Darroch and D Ratcliff. A note on capture-recapture estimation. Biometrics, 
36:149-153, 1980. 


[DS08] 


Vladislav K Dzyadyk and Igor A Shevchuk. Theory of uniform approximation of func¬ 
tions by polynomials. Walter de Gruyter, 2008. 


24 





[ED] 

[ET76] 

[FCW43] 

[Goo53] 

[GS04] 

[GT56] 

[Har68] 

[HWOl] 

[INK87] 

[JVHW15] 

[LC86] 

[LNS99] 

[LP56] 

[Mar92] 

[McN73] 

[ML07] 

[Mon] 


Oxford English Dictinary. http://public.oed.com/about/. Accessed: 2016-02-16. 

B. Efron and R. Thisted. Estimating the number of unseen species: How many words 
did Shakespeare know? Biometrika, 63(3):435-447, 1976. 

Ronald Aylmer Eisher, A Steven Gorbet, and Carrington B Williams. The relation 
between the number of species and the number of individuals in a random sample of 
an animal population. The Journal of Animal Ecology, pages 42-58, 1943. 

Irving J Good. The population frequencies of species and the estimation of population 
parameters. Biometrika, 40(3-4):237-264, 1953. 

Alberto Gandolh and C.C.A. Sastri. Nonparametric estimations about species not 
observed in a random sample. Milan Journal of Mathematics, 72(1):81-105, 2004. 

I. J. Good and G.H. Toulmin. The number of new species, and the increase in population 
coverage, when a sample is increased. Biometrika, 43(l-2):45-63, 1956. 

Bernard Harris. Statistical inference in the classical occupancy problem unbiased esti¬ 
mation of the number of classes. Journal of the American Statistical Association, pages 
837-847, 1968. 

Shu-Pang Huang and BS Weir. Estimating the total number of alleles using a sample 
coverage method. Genetics, 159(3): 1365-1373, 2001. 

LA. Ibragimov, A.S. Nemirovskii, and R.Z. Khas’minskii. Some problems on nonpara¬ 
metric estimation in gaussian white noise. Theory of Probability & Its Applications, 
31(3):391-406, 1987. 

Jiantao Jiao, Kartik Venkat, Yanjun Han, and Tsachy Weissman. Minimax estimation 
of functionals of discrete distributions. IEEE Transactions on Information Theory, 
61(5):2835-2885, 2015. 

L. Le Cam. Asymptotic methods in statistical decision theory. Springer-Verlag, New 
York, NY, 1986. 

Oleg Lepski, Arkady Nemirovski, and Vladimir Spokoiny. On estimation of the norm 
of a regression function. Probability theory and related fields, 113(2):221-253, 1999. 

Richard C Lewontin and Timothy Prout. Estimation of the number of different classes 
in a population. Biometrics, 12(2):211-223, 1956. 

VA Markov. On functions of least deviation from zero in a given interval. St. Petersburg, 
892, 1892. 

Donald R McNeil. Estimating an author’s vocabulary. Journal of the American Statis¬ 
tical Association, 68(341) :92-96, 1973. 

Chang Xuan Mao and Bruce G Lindsay. Estimating the number of classes. The Annals 
of Statistics, 35(2):917-930, 2007. 

Global Language Monitor. Number of words in the english language. http://www. 
languagemonitor.com/?attachment_id=8505. Accessed: 2016-02-16. 


25 


[MSJ82] 

[MU05] 

[Rob68] 

[RRSS09] 

[Sam68] 

[Str85] 

[TE87] 

[Tim63] 

[VVIO] 

[VVll] 

[VV13] 

[WY16] 


JP Marchand and FE Schroeck Jr. On the estimation of the number of equally 
likely classes in a population. Communications in Statistics-Theory and Methods, 
11(10):1139-1146, 1982. 

Michael Mitzenmacher and Eli Upfal. Probability and computing: Randomized algo¬ 
rithms and probabilistic analysis. Cambridge University Press, 2005. 

Herbert E Robbins. Estimating the total probability of the unobserved outcomes of an 
experiment. The Annals of Mathematical Statistics, 39(l):256-257, 1968. 

Sofya Raskhodnikova, Dana Ron, Amir Shpilka, and Adam Smith. Strong lower bounds 
for approximating distribution support size and the distinct elements problem. SIAM 
Journal on Computing, 39(3);813-842, 2009. 

Ester Samuel. Sequential maximum likelihood estimation of the size of a population. 
The Annals of Mathematical Statistics, 39(3):1057-1068, 1968. 

Helmut Strasser. Mathematical theory of statistics: Statistical experiments and asymp¬ 
totic decision theory. Walter de Gruyter, Berlin, Germany, 1985. 

Ronald Thisted and Bradley Efron. Did Shakespeare write a newly-discovered poem? 
Biometrika, 74(3);445-455, 1987. 

Aleksandr Filippovich Timan. Theory of approximation of functions of a real variable. 
Pergamon Press, 1963. 

Gregory Valiant and Paul Valiant. A CLT and tight lower bounds for estimating 
entropy. In Electronic Colloquium on Computational Complexity (ECCC), volume 17, 
page 179, 2010. 

Gregory Valiant and Paul Valiant. Estimating the unseen: an n/log(n)-sample estima¬ 
tor for entropy and support size, shown optimal via new CLTs. In Proceedings of the 
43rd annual ACM symposium on Theory of computing, pages 685-694, 2011. 

Paul Valiant and Gregory Valiant. Estimating the unseen: Improved estimators for 
entropy and other properties. In Advances in Neural Information Processing Systems, 
pages 2157-2165, 2013. 

Yihong Wu and Pengkun Yang. Minimax rates of entropy estimation on large alpha¬ 
bets via best polynomial approximation. IEEE Transactions on Information Theory, 
62(6):3702-3720, 2016. 


26 



